Data Mining - Computer Assignment 1

Mohammad Saadati - 810198410

Introduction

The goal of this computer assignment is review the data set of COVID-19, pre-processing, illustration and express our analysis based on the illustrations performed. This data set have information about patients, new tests, patient deaths and some features related to countries are given separately by day.

Import Libraries

First of all, we import necessary libraries to use their functions.

Preprocessing

Pre-processing is one of the most important steps in data mining projects. Various approaches are used in the field of lost data management and data conversion to other formats, and the careful selection of these approaches has a direct impact on the quality of the final results; Therefore, the best approach should always be identified and applied.

First we load csv file as a DataFrame using pandas library.

Question 1

In this part first we use two Pandas function (isna and sum) to count number of missing values at each column.

Now use fillna function to fill out the missing values in the given column.

To fill in the missing values in numeric columns:

To fill in the missing values in the continent column:

To fill in the missing values in the tests_units column:

Note: for tests_units column, at first we tried to fill in the missing values with the corresponding location mode, but since it took a long time, we did not do this method.

Advantages and disadvantages of using mean for missing values:

Missing values in your data do not reduce your sample size, as it would be the case with listwise deletion (the default of many statistical software packages, e.g. R, Stata, SAS or SPSS). Since mean imputation replaces all missing values, you can keep your whole database.

Replacing missing data by the mean of nonmissing data causes the population SD to be underestimated, but may also obscure important features of the population from which the data were sampled. Another possible disadvantage with using the mean for missing values is that the reason the values are missing in the first place could be dependent on the missing values themselves. (This is called missing not at random.)

Advantages : Easy to apply - Mean will not change

Disadvantages : Results may not be accurate - For large amount of NaN data this method can increase mod value significantly and cause error in results - The variance will decrease

Question 2

In this section, in a new dataframe called df_p1_q2, we calculate the number of new_cases, the number of new_vaccinations, the number of new_deaths and the population for each country aggregated.

We use the groupby function to aggregate. We use sum() method to aggregate the new columns and max() method to aggregate the total columns

Question 3

In this part we use jdatetime.date.fromgregorian function from jdatetime library to convert gregorian date to shamsi date and store the result in shamsi_date column of a new dataframe called df_p1_q3.

Question 4

Yes, for example, you can ignore the total column and replace the aggregate values of the new columns with the total column.

Redundant attributes may be able to be detected by correlation analysis and covariance analysis

Question 5

Creating a new dataframe that has only information about Iran and we called it df_iran.

Question 6

In this part, we will create a new column called shamsi_month for df_iran, which is considered as an independent feature in that month.

Question 7

In this part, we create a new dataframe that aggregates the Iranian dataframe by month in 2021 and called it df_p1_q7.

We use the groupby function to aggregate. We use sum() method to aggregate the new columns and max() method to aggregate the total columns

Display data

One of the most used items in data mining is data visualization, which can be used to gain an understanding of the data set and also provide a complete analysis based on the obtained diagrams.

Question 1

In this section, using a bar chart, we identify the countries that have had the best and worst performance (number of deaths relative to the total population) in controlling the corona virus.

Question 2

In this part, we want to investigate the effect of vaccination on the number of deaths. To do this, for each country, we calculate the number of new_vaccinations per population of that country, and after sorting the obtained values, we select one country out of every fifth one for review and comparison.

Now we select the desired countries

According to the above chart, it can be said that increasing the number of vaccinations does not necessarily reduce the number of deaths and parameters other than the number of vaccinations affect the number of deaths.

Question 3

In this part, We intend to examine the speed of vaccination in different countries. To do this, we sort the values in new_vaccinations column, and after sorting, we select one country out of every fifth one for review and comparison.

Now we select the desired countries

First we aggregate country data by gregorian_month in 2021 year, then we will use the scatter plot diagram to check the rate of vaccination.

European countries with smaller populations have had almost the same rate of vaccination rates, but countries with larger populations have had their vaccination rates increase over a period of time, but with the advent of new forms of the coronavirus, their vaccination rates have decreased. Less developed countries also have almost constant vaccination rates.

Question 4

In this section, we examine the strictness trend in the field of corona in Iran. To do this, we use the stringency_index factor.

Before and after Nowruz, we see the most astrictness, but after a while, the strictness decreases, and then with the arrival of new strains from the corona, we see an increase in strictness.

Question 5

In this part, we use box function from plotly.express library to draw a Box Plot to characterize the number of death in each country

Based on the box plot, South America, United States ,Brazil ,India ,Russia ,Mexico ,Africa and Peru are outlying countries and we remove them from our data for analysis

According to the median and average values, the skewness of the boxplot is downwards

Question 6

In this section, we use scatterplot to analysis features impact

According to the plotted diagrams, the population_density feature and the median_age, handwashing_facilities, hospital_beds_per_thousand, human_development_index features have almost the same effect on the new_deaths, new_cases features.

Question 7

In this section, we examine the relationship between the economic situation of countries and the number of people vaccinated.

Countries with a more stable economic situation have more people vaccinated.

Question 8

In this part, we examine the distribution of the number of patients by month in 2021

It can almost be said that the distribution of patients by month has a sinusoidal trend and according to the global corona situation such as new forms of corona and the amount of vaccination, the number of patients per month changes and there is no definite trend.

Bonus questions

Question 1

In this part, we show the number of deaths of the last three months in different countries in relation to their population on a map. To do this, we use the geopandas library

Question 2

In this section, we add the number of deaths and the number of vaccinated in Iran on a weekly basis. Then, we draw in a suitable diagram.

According to the graphs, it can be concluded that in some special weeks, with the arrival of the new corona wave and due to the low rate of vaccination or non-vaccination, especially in the first year of corona arrival, the number of deaths increased and in other weeks due to inhibition relative coronavirus and increased vaccination, we see fewer deaths.